Submitted by Srimanth Agastyaraju (sragas) as part of the coursework for the course INFO-I 535, Fall 2022.
In this notebook, we create a pipeline to explore the homicide reports in the United States. To establish a pipeline, I've used IU's Jetstream 2 as the cloud infrastructure. I've implemented my pipeline primarily in PySpark, with Pandas to collect the summary queries and Plotly for visualization.
This project fulfills the following criteria:
Dataset Source: https://www.kaggle.com/datasets/murderaccountability/homicide-reports
Homicide is one of the leading causes of death. According to the CDC, All homicides - Number of deaths is 24,576 and Deaths per 100,000 population is 7.5. In this project, I hope to answer some questions by summarizing and visualizing the data. I've always been an avid reader of mystery novels. As a result, in the previous assignment "Data Analysis using PySpark", I chose to work on the Sherlock Holmes text data. In today's world of data analytics, equipped with the knowledge gained from this course, I believe I can play the role of a "Digital Detective".
This dataset contains 638454 rows and 24 columns. Most of the columns in the dataset are categorical, with a few exceptions. The dataset is very clean, except for a few zeros in the age columns, which should not be present. We address some of these challenges in this project.
The following steps were run:
sudo docker --version
nano docker-compose.yaml # Edited the file as outlined below
sudo docker-compose up # Starts the Jupyter Lab server
Contents of the docker-compose.yaml file
version: '3'
services:
spark:
image: jupyter/pyspark-notebook
ports:
- "8888:8888"
- "4040-4080:4040-4080"
volumes:
- ./notebooks:/home/jovyan/work/notebooks/?
The following python packages were installed in the environment:
pip3 install numpy pandas matplotlib seaborn plotly geopandas
In this pipeline, I have been able to answer the following questions:
NOTE: Plotly is a library to create interactive visualizations. In this notebook, you can interact with the plots by click and drag to zoom/pan, single click label to remove, and double click label to isolate.
from IPython.display import Image
Image(filename='J2.png')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly
import plotly.io as pio
import geopandas as gpd
from IPython.display import display, Markdown
plotly.offline.init_notebook_mode()
pio.templates.default = "simple_white"
# sns.set(rc={'figure.figsize':(11.7,8.27)})
# sns.set(font_scale=0.9)
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
spark = SparkContext()
sql_sc = SQLContext(spark)
spark
/usr/local/spark/python/pyspark/sql/context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
dataframe = sql_sc.read.csv("./Data/homicide_reports.csv", header=True)
dataframe.show(2)
+---------+-----------+-----------+----------------+---------+------+----+-------+--------+--------------------+------------+----------+----------+--------------------+----------------+---------------+---------------+--------------------+---------------------+------------+-------------+------------+-----------------+-------------+ |Record ID|Agency Code|Agency Name| Agency Type| City| State|Year| Month|Incident| Crime Type|Crime Solved|Victim Sex|Victim Age| Victim Race|Victim Ethnicity|Perpetrator Sex|Perpetrator Age| Perpetrator Race|Perpetrator Ethnicity|Relationship| Weapon|Victim Count|Perpetrator Count|Record Source| +---------+-----------+-----------+----------------+---------+------+----+-------+--------+--------------------+------------+----------+----------+--------------------+----------------+---------------+---------------+--------------------+---------------------+------------+-------------+------------+-----------------+-------------+ | 000001| AK00101| Anchorage|Municipal Police|Anchorage|Alaska|1980|January| 1|Murder or Manslau...| Yes| Male| 14|Native American/A...| Unknown| Male| 15|Native American/A...| Unknown|Acquaintance| Blunt Object| 0| 0| FBI| | 000002| AK00101| Anchorage|Municipal Police|Anchorage|Alaska|1980| March| 1|Murder or Manslau...| Yes| Male| 43| White| Unknown| Male| 42| White| Unknown|Acquaintance|Strangulation| 0| 0| FBI| +---------+-----------+-----------+----------------+---------+------+----+-------+--------+--------------------+------------+----------+----------+--------------------+----------------+---------------+---------------+--------------------+---------------------+------------+-------------+------------+-----------------+-------------+ only showing top 2 rows
CATEGORICAL_COLUMNS = list(dataframe.columns)
CATEGORICAL_COLUMNS.remove("Record ID")
CATEGORICAL_COLUMNS.remove("Agency Code")
CATEGORICAL_COLUMNS.remove("Agency Name")
CATEGORICAL_COLUMNS.remove("Victim Age")
CATEGORICAL_COLUMNS.remove("Perpetrator Age")
CATEGORICAL_COLUMNS.remove("City")
CATEGORICAL_COLUMNS.remove("Year")
CATEGORICAL_COLUMNS.remove("Incident")
def bar_or_pie_chart(df, x, y):
length = len(df)
if length > 5:
title = f"Count plot for the column {x}"
fig = px.bar(df, x=x, y=y, color=list(range(length)), color_continuous_scale="agsunset", title=title)
else:
title = f"Pie chart - {column}"
fig = px.pie(df, values=y, names=x, title=title)
fig.update_layout(width=1200, height=900)
fig.update_coloraxes(showscale=False)
fig.show()
# Top 20 Cities with the highest crime
count_query = dataframe.groupBy("City").count().sort(col("count").desc()).toPandas().iloc[:20, :]
bar_or_pie_chart(count_query, "City", "count")
for column in CATEGORICAL_COLUMNS:
count_query = dataframe.groupBy(column).count().sort(col("count").desc()).toPandas()
display(Markdown(f"#### Number of homicides per {column}"))
# sns.barplot(count_query, x=column, y="count").set_title(f"Count plot for the column {column}")
# plt.xticks(
# rotation=45,
# horizontalalignment='right',
# )
# plt.tight_layout()
# plt.show()
bar_or_pie_chart(count_query, column, "count")
query_df = dataframe.groupBy("Year").count().orderBy("Year").toPandas()
# sns.barplot(data=query_df, x="Year", y="count")
# plt.xticks(
# rotation=45,
# horizontalalignment='right',
# )
# plt.tight_layout()
title = "Total number of homicides each year"
fig = px.bar(query_df, x="Year", y="count", color=list(range(len(query_df))), color_continuous_scale="agsunset", title=title)
fig.add_traces(plotly.graph_objects.Scatter(x=list(query_df["Year"]), y=list(query_df["count"]), mode='lines'))
fig.update_layout(width=1500, height=800, showlegend=False)
fig.update_coloraxes(showscale=False)
fig.show()
query_df = dataframe.groupBy("Weapon", "Perpetrator Sex").count().sort(col("count").desc()).toPandas()
# sns.barplot(query_df, x="Weapon", y="count", hue="Perpetrator Sex")\
# .set_title(f"Murder Weapon counts - Solved and not solved")
# plt.xticks(
# rotation=45,
# horizontalalignment='right',
# )
# plt.tight_layout()
# fig = px.bar(query_df, x="Year", y="count", color=list(range(len(query_df))), color_continuous_scale=px.colors.sequential.Sunsetdark)
title = "Weapon of choice for different Perpetrator Sexes"
fig = px.bar(query_df, x="Weapon", y="count", color="Perpetrator Sex", barmode='group', title=title)
fig.update_layout(width=1400, height=800)
fig.show()
query_df = dataframe.groupBy("Weapon", "Crime Solved").count().sort(col("count").desc()).toPandas()
# sns.barplot(query_df, x="Weapon", y="count", hue="Crime Solved")\
# .set_title(f"Murder Weapon counts - Solved and not solved")
# plt.xticks(
# rotation=45,
# horizontalalignment='right',
# )
# plt.tight_layout()
title = "Weapon of choice help in Crime Solving"
fig = px.bar(query_df, x="Weapon", y="count", color="Crime Solved", barmode='group', title=title)
fig.update_layout(width=1400, height=800)
fig.show()
# Stranger's weapon of choice
stranger_df = dataframe.filter(dataframe["Relationship"] == "Stranger")
query_df = stranger_df.groupBy("Weapon").count().sort(col("count").desc()).toPandas()
# sns.barplot(query_df, x="Weapon", y="count").set_title(f"Weapon of choice for strangers")
# plt.xticks(
# rotation=45,
# horizontalalignment='right',
# )
# plt.tight_layout()
title = "Weapon of Choice for a Stranger"
fig = px.bar(query_df, x="Weapon", y="count", color=list(range(len(query_df))), color_continuous_scale="agsunset", title=title)
fig.update_layout(width=1400, height=800)
fig.update_coloraxes(showscale=False)
fig.show()
def age_to_cat_map(age):
age = int(age)
if age < 18:
return "Juvenile"
elif 18 <= age <= 39:
return "Young Adult"
elif 40 <= age <= 59:
return "Middle Aged Adults"
elif age>59:
return "Old"
perp_age_cat = dataframe.filter(dataframe["Perpetrator Age"] > 0)
bucket_udf = udf(age_to_cat_map, StringType())
perp_age_cat_pd = perp_age_cat.withColumn("Perpetrator Age Category", bucket_udf("Perpetrator Age"))\
.groupBy("Perpetrator Age Category").count().orderBy('count').toPandas()
# sns.barplot(perp_age_cat_pd, x="Perpetrator Age Category", y="count")\
# .set_title(f"Perpetrator age categories")
# plt.tight_layout()
title = "Age categories of a perpetrator"
fig = px.pie(perp_age_cat_pd, names="Perpetrator Age Category", values="count", title=title)
fig.update_coloraxes(showscale=False)
fig.update_layout(width=1200, height=900)
fig.show()
victim_age_cat = dataframe.filter(dataframe["Victim Age"] > 0)
bucket_udf = udf(age_to_cat_map, StringType())
victim_age_cat_pd = victim_age_cat.withColumn("Victim Age Category", bucket_udf("Victim Age"))\
.groupBy("Victim Age Category").count().orderBy('count').toPandas()
# sns.barplot(victim_age_cat_pd, x="Victim Age Category", y="count")\
# .set_title(f"Victim age categories")
# plt.tight_layout()
title = "Age categories of a victim"
fig = px.pie(victim_age_cat_pd, names="Victim Age Category", values="count", title=title)
fig.update_coloraxes(showscale=False)
fig.update_layout(width=1200, height=900)
fig.show()
non_zero_age_df = dataframe.filter((dataframe["Victim Age"] > 0) & (dataframe["Perpetrator Age"] > 0))
bucket_udf = udf(age_to_cat_map, StringType())
non_zero_age_df = non_zero_age_df.withColumn("Perp Older", non_zero_age_df["Perpetrator Age"] > non_zero_age_df["Victim Age"])
bool_to_conf_udf = udf(lambda x: "Yes" if x is True else "No", StringType())
non_zero_age_df = non_zero_age_df.withColumn("Perp Older", bool_to_conf_udf("Perp Older"))
perp_older_df = non_zero_age_df.groupBy("Perp Older").count().toPandas()
fig = px.pie(perp_older_df, values='count', names='Perp Older', title="Is the Perpetrator older than the Victim?")
fig.update_layout(width=1200, height=900)
fig.show()
# Is the perpetrator of the same race as the victim?
same_race_df = non_zero_age_df.withColumn("Perp Same Race", non_zero_age_df["Perpetrator Race"] == non_zero_age_df["Victim Race"])
bool_to_conf_udf = udf(lambda x: "Yes" if x is True else "No", StringType())
same_race_df = same_race_df.withColumn("Perp Same Race", bool_to_conf_udf("Perp Same Race"))
same_race_df = same_race_df.groupBy("Perp Same Race").count().toPandas()
fig = px.pie(same_race_df, values='count', names='Perp Same Race', title="Is the Perpetrator of the same race as the victim?")
fig.update_layout(width=1200, height=900)
fig.show()
crimes_per_city = dataframe.groupBy("City").count().toPandas()
crimes_per_city = crimes_per_city.rename(columns={"City": "NAME", "count": "Crimes Per City"})
usa = gpd.read_file('https://raw.githubusercontent.com/holtzy/The-Python-Graph-Gallery/master/static/data/US-counties.geojson')
result = pd.merge(usa, crimes_per_city, on="NAME", how="inner")
result["Crimes Per City Log10"] = np.log10(result["Crimes Per City"])
title="Geo Heatmap of Crimes in USA in every county - Log10 Scaled."
fig = px.choropleth(result, geojson=result, locations='COUNTY', color="Crimes Per City Log10",
color_continuous_scale="sunset",
# showcountries=True, countrycolor="Black",
# showsubunits=True, subunitcolor="Blue",
scope="usa", title=title)
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(width=1400, height=700)
fig.update_geos(showsubunits=True, subunitcolor="black")
fig.show()